Last update: March 19, 2020
Todo
See also References and further reading (last slide), for further reading material.
\(~\)
\(~\)
\(~\)
\(~\)
Neural networks (NN) were first introduced in the 1990’s.
Shift from statistics to computer science and machine learning, as they are highly parameterized
Statisticians were skeptical: ``It’s just a nonlinear model’’.
After the first hype, NNs were pushed aside by boosting and support vector machines.
Revival since 2010: The emergence of as a consequence of improved computer resources, some innovations, and applications to image and video classification, and speech and text processing
\(~\)
\(~\)
\(~\)
\(~\)
So we first need to understand: what is a neural network?
Neuron and myelinated axon, with signal flow from inputs at dendrites to outputs at axon terminals. Image credits: By Egm4313.s12 (Prof. Loc Vu-Quoc) https://commons.wikimedia.org/w/index.php?curid=72816083
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
According to Chollet and Allaire (2018) (page 19):Recapitulate from Module 3 with the bodyfat dataset that contained the following variables.
bodyfat: % of body fat.age: age of the person.weight: body weighth.height: body height.neck: neck thickness.bmi: bmi.abdomen: circumference of abdomen.hip: circumference of hip.We will now look at modelling the bodyfat as response and using all other variables as covariates - this will give us
Let \(n\) be the number of observations in the training set, here \(n=243\).
(from Module 3)
\(~\)
We assume \[ Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i={\boldsymbol x}_i^T{\boldsymbol\beta}+\varepsilon_i \ , \]
for \(i=1,\ldots,n\), where \(x_{ij}\) is the value \(j\)th predictor for the \(i\)th datapoint, and \({\boldsymbol\beta}^\top = (\beta_0,\beta_1,\ldots,\beta_p)\) the regression coeffficients.
\(~\)
We used the compact matrix notation for all observations \(i=1,\ldots,n\) together: \[{\boldsymbol Y}={\boldsymbol {X}} \boldsymbol{\beta}+{\boldsymbol{\varepsilon}} \ .\]
Assumptions:
The classical normal linear regression model is obtained if additionally
\(~\)
How can our statistical model be represented as a network?
\(~\)
We need :
\(~\)
\(~\)
\(~\)
\(~\)
## # weights: 8
## initial value 1842678.930077
## iter 10 value 4471.377790
## final value 4415.453824
## converged
\[\begin{equation*} Y_i=\beta_0 + \beta_1 x_{i1}+\beta_2 x_{i2}+\cdots + \beta_p x_{ip}+\varepsilon_i \ . \end{equation*}\]
\(~\)
We do not say anything of what is random and fixed, and do not make any assumption distribution of a random variable.
In the statistics world we would have written \(\hat{y}_1({\boldsymbol x}_i)\) to specify that we are estimating a predicted value of the response for the given covariate value. To be able to distinguish this predicted response from the observed response we use the notation: \[ \hat{y}_1({\boldsymbol x}_i)=\phi_o(w_0+w_1 x_{i1}+\cdots + w_p x_{ip})\]
The only difference to our MLR model is then that we would have called the \(w\)s \(\hat{\beta}\)s instead.
\(~\)
We now translate what we did for the regression setup into the neural networks world:
Replace the parameters \(\boldsymbol\beta\) with network weights \(\boldsymbol w\).
Replace the RSS in our training data set with the following (mean squared error) \[ J({\boldsymbol w})=\frac{1}{n}\sum_{i=1}^n (y_i-{\hat{y}_1({\boldsymbol x}_i)})^2 \ ,\] where \(J({\boldsymbol w})\) indicates that the unknown parameters are the weights \({\boldsymbol w}\).
\(~\)
\(\rightarrow\) that work also when the loss function does not have a closed form.
\(~\)
(https://github.com/SoojungHong/MachineLearning/wiki/Gradient-Descent)
\(~\) \(~\)
Other figures that give good illustration of the optimization problem in Chollet and Allaire (2018):
\(~\)
Here we compare
lmnnet \(~\)
Linear regression vs. neural networks: an example.
\(~\)
fit = lm(bodyfat ~ age + weight + height + bmi + neck + abdomen + hip,
data = d.bodyfat)
fitnnet = nnet(bodyfat ~ age + weight + height + bmi + neck + abdomen +
hip, data = d.bodyfat, linout = TRUE, size = 0, skip = TRUE, maxit = 1000,
entropy = FALSE)
## # weights: 8
## initial value 1404123.130460
## iter 10 value 4470.047348
## final value 4415.453729
## converged
cbind(fitnnet$wts, fit$coefficients)
## [,1] [,2]
## (Intercept) -9.748904e+01 -9.748903e+01
## age -9.607668e-04 -9.607669e-04
## weight -6.292821e-01 -6.292820e-01
## height 3.974884e-01 3.974884e-01
## bmi 1.785331e+00 1.785330e+00
## neck -4.945725e-01 -4.945725e-01
## abdomen 8.945189e-01 8.945189e-01
## hip -1.255549e-01 -1.255549e-01
Aim is to predict if a person has diabetes. The data stem from a population of women of Pima Indian heritage in the US, available in the R MASS package. The following information is available for each woman:
\(~\)
diabetes: 0= not present, 1= presentnpreg: number of pregnanciesglu: plasma glucose concentration in an oral glucose tolerance testbp: diastolic blood pressure (mmHg)skin: triceps skin fold thickness (mm)bmi: body mass index (weight in kg/(height in m)\(^2\))ped: diabetes pedigree function.age: age in years\(~\)
\(~\)
\(~\)
(Maximum likelihood)
\[ \ln(L(\boldsymbol{\beta}))=l(\boldsymbol{\beta}) =\sum_{i=1}^n \Big ( y_i \ln p_i + (1-y_i) \ln(1 - p_i )\Big ) \ .\]
\(~\)
\(~\)
\[ \hat{y}_1({\boldsymbol x}_i)=\frac{1}{1+\exp(-(w_0+w_1 x_{i1}+\cdots + w_r x_{ir}))} \in (0,1) \ . \]
\(~\)
\(~\)
\(~\)
fitlogist = glm(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, family = binomial(link = "logit"))
summary(fitlogist)
##
## Call:
## glm(formula = diabetes ~ npreg + glu + bp + skin + bmi + ped +
## age, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9830 -0.6773 -0.3681 0.6439 2.3154
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.773062 1.770386 -5.520 3.38e-08 ***
## npreg 0.103183 0.064694 1.595 0.11073
## glu 0.032117 0.006787 4.732 2.22e-06 ***
## bp -0.004768 0.018541 -0.257 0.79707
## skin -0.001917 0.022500 -0.085 0.93211
## bmi 0.083624 0.042827 1.953 0.05087 .
## ped 1.820410 0.665514 2.735 0.00623 **
## age 0.041184 0.022091 1.864 0.06228 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 178.39 on 192 degrees of freedom
## AIC: 194.39
##
## Number of Fisher Scoring iterations: 5
set.seed(787879)
library(nnet)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 1000,
entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 0.1))
## # weights: 8
## initial value 213.575955
## iter 10 value 89.511044
## final value 89.195333
## converged
# entropy=TRUE because default is least squares
cbind(fitnnet$wts, fitlogist$coefficients)
## [,1] [,2]
## (Intercept) -9.773046277 -9.773061533
## npreg 0.103183171 0.103183427
## glu 0.032116832 0.032116823
## bp -0.004767678 -0.004767542
## skin -0.001917105 -0.001916632
## bmi 0.083624151 0.083623912
## ped 1.820397792 1.820410367
## age 0.041183744 0.041183529
\(~\)
By setting entropy=TRUE we minimize the cross-entropy loss.
plotnet(fitnnet)
But, there may also exist local minima.
\(~\)
set.seed(123)
fitnnet = nnet(diabetes ~ npreg + glu + bp + skin + bmi + ped + age,
data = train, linout = FALSE, size = 0, skip = TRUE, maxit = 10000,
entropy = TRUE, Wts = fitlogist$coefficients + rnorm(8, 0, 1))
## # weights: 8
## initial value 24315.298582
## final value 12526.062906
## converged
cbind(fitnnet$wts, fitlogist$coefficients)
## [,1] [,2]
## (Intercept) -36.733537 -9.773061533
## npreg -77.126994 0.103183427
## glu -2984.409175 0.032116823
## bp -1835.934259 -0.004767542
## skin -718.072629 -0.001916632
## bmi -818.561311 0.083623912
## ped -8.687473 1.820410367
## age -773.023878 0.041183529
Why can NN and logistic regression lead to such different results?
\(~\)
The iris flower data set was introduced by the British statistician and biologist Ronald Fisher in 1936.
Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.\(~\)
The aim is to predict the species of an iris plant.
\(~\)
We only briefly mentioned multiclass regression in module 4.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
The activation function for the ouput layer is called . For each class \(c=1,\ldots, C\) it is given as \[ \hat{y}_c({\boldsymbol x}_i) = \frac{\exp({\boldsymbol x}_i^T{\boldsymbol w}_c)}{\sum_{s=1}^{C}\exp({\boldsymbol x}_i^T{\boldsymbol w}_s)} \ , \] where each \({\boldsymbol w}_s\) is a \(r+1\) dimensional vector of weights.
Note: there is some redundancy here, since \(\sum_{c=1}^C {\hat y}_{c}({\boldsymbol x}_i)=1\), so we could have had \(C-1\) output nodes, but this is not done.
\(~\)
Q: How many parameters are we estimating?
\(~\)
\(~\)
\(~\)
First select a training sample
\(~\)
library(nnet)
set.seed(123)
train = sample(1:150, 50)
iris_train = ird[train, ]
iris_test = ird[-train, ]
Then fit the nnet() (by default using the softmax activation function)
set.seed(123)
iris.nnet <- nnet(species ~ ., data = ird, subset = train, size = 0,
skip = TRUE, rang = 0.1, maxit = 200)
## # weights: 15
## initial value 56.732848
## iter 10 value 2.748260
## iter 20 value 0.012982
## final value 0.000093
## converged
How many weights have been estimated?
What does the graph look like?
\(~\)
summary(iris.nnet)
## a 4-0-3 network with 15 weights
## options were - skip-layer connections softmax modelling
## b->o1 i1->o1 i2->o1 i3->o1 i4->o1
## 113.44 -0.36 -4.82 -11.51 -6.65
## b->o2 i1->o2 i2->o2 i3->o2 i4->o2
## 7.56 20.19 23.17 -38.05 -36.15
## b->o3 i1->o3 i2->o3 i3->o3 i4->o3
## -121.04 -19.78 -18.26 49.66 42.79
Fitting the multinomial regression. This is also done with nnet, but using a wrapper multinom (this has it’s own settings, so results are not necessarily the same as above).
library(caret)
fit = multinom(species ~ -1 + ., family = multinomial, data = iris_train)
## # weights: 15 (8 variable)
## initial value 54.930614
## iter 10 value 4.353139
## iter 20 value 0.139411
## iter 30 value 0.065218
## iter 40 value 0.056419
## iter 50 value 0.045548
## iter 60 value 0.020867
## iter 70 value 0.016116
## iter 80 value 0.012952
## iter 90 value 0.012787
## iter 100 value 0.009090
## final value 0.009090
## stopped after 100 iterations
coef(fit)
## Sepal.L. Sepal.W. Petal.L. Petal.W.
## s -5.91976 21.30247 -12.52073 -2.774511
## v -39.93182 -28.07162 53.73326 41.264390
Problem: multinom() seems to fit an intercept plus an offset node (B), thus we have to remove the intercept manually (by saying -1 in the above formula).
\(~\)
testclass = predict(fit, new = iris_test)
confusionMatrix(data = testclass, reference = iris_test$species)$table
## Reference
## Prediction c s v
## c 28 0 0
## s 0 36 0
## v 4 0 32
\(~\)
table(predict(iris.nnet, iris_test, type = "class"), iris_test$species)
##
## c s v
## c 31 0 7
## s 0 36 0
## v 1 0 25
For more on multinomial regression with R, check here.
\(~\)
\(~\)
\(~\)
\(~\) But:
These are only linear models (linear boundaries).
Parameters (weights) found using gradient descent algorithms where the learning rate (step length) must be set.
Connections are only forward in the network, but no feedback connections that sends the output of the model back into the network.
Examples: Linear, logistic and multinomial regression with without any hidden layers (between the input and output layers).
We may have no hidden layer, one (to be studied next), or many.
Adding hidden layers with non-linear activation functions between the input and output layer will make nonlinear statistical models.
The number of hidden layers is called the depth of the network, and the number of nodes in a layer is called the width of the layer.
test
\(~\)
\(~\)
\(~\)
The says that a feedforward network with
can approximate any (Borel measurable) function from one finite-dimensional space (our input layer) to another (our output layer) with any desired non-zero amount of error.
The universal approximation theorem holds for
in the hidden layer.
\(~\)
\(~\)
nnet R package\(~\)
For simplicity (since we only have one week to work with this module), we will use the nnet R package by Brian Ripley instead of the currently very popular keras package for deep learning (the keras package will be presented for completeness).
nnet fits one hidden layer with sigmoid activiation function. The implementation is not gradient descent, but instead BFGS using optim.
Type ?nnet() into your R-console to see the arguments of nnet().
If the response in formula is a factor, an appropriate classification network is constructed; this has one output and entropy fit if the number of levels is two, and a number of outputs equal to the number of classes and a softmax output stage for more levels.
Objective: To predict the median price of owner-occupied homes in a given Boston suburb in the mid-1970s using 10 input variables.
This data set is both available in the MASS and keras R package.
\(~\)
keras library)library(keras)
dataset <- dataset_boston_housing()
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset
str(train_targets)
## num [1:404(1d)] 15.2 42.3 50 21.1 17.7 18.5 11.3 15.6 15.6 14.4 ...
head(train_data)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 1.23247 0.0 8.14 0 0.538 6.142 91.7 3.9769 4 307 21.0
## [2,] 0.02177 82.5 2.03 0 0.415 7.610 15.7 6.2700 2 348 14.7
## [3,] 4.89822 0.0 18.10 0 0.631 4.970 100.0 1.3325 24 666 20.2
## [4,] 0.03961 0.0 5.19 0 0.515 6.037 34.5 5.9853 5 224 20.2
## [5,] 3.69311 0.0 18.10 0 0.713 6.376 88.4 2.5671 24 666 20.2
## [6,] 0.28392 0.0 7.38 0 0.493 5.708 74.3 4.7211 5 287 19.6
## [,12] [,13]
## [1,] 396.90 18.72
## [2,] 395.38 3.11
## [3,] 375.52 3.26
## [4,] 396.90 8.01
## [5,] 391.43 14.65
## [6,] 391.13 11.74
\(~\)
The column names are missing (could get them by using the Boston dataset loaded from the MASS library).
\(~\)
org_train = train_data
mean <- apply(train_data, 2, mean)
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)
\(~\)
Just checking out one hidden layer with 5 units to get going.
library(nnet)
fit5 <- nnet(train_targets ~ ., data = train_data, size = 5, linout = TRUE,
maxit = 1000)
## # weights: 76
## initial value 222875.445564
## iter 10 value 9625.869446
## iter 20 value 7008.319688
## iter 30 value 5889.493080
## iter 40 value 4914.814874
## iter 50 value 4116.546650
## iter 60 value 3860.485941
## iter 70 value 3705.511019
## iter 80 value 3660.347303
## iter 90 value 3616.529740
## iter 100 value 3566.523287
## iter 110 value 3549.835275
## iter 120 value 3519.659951
## iter 130 value 3450.648101
## iter 140 value 3359.604509
## iter 150 value 3313.298120
## iter 160 value 3281.887294
## iter 170 value 3256.471450
## iter 180 value 3239.303308
## iter 190 value 3205.304637
## iter 200 value 3180.026739
## iter 210 value 3166.982443
## iter 220 value 3165.194791
## iter 230 value 3164.924614
## iter 240 value 3164.431322
## iter 250 value 3164.319839
## iter 250 value 3164.319818
## iter 250 value 3164.319818
## final value 3164.319818
## converged
summary(fit5)
## a 13-5-1 network with 76 weights
## options were - linear output units
## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1
## -9.54 14.52 39.89 3.83 92.85 -159.44 111.11 -96.54 -112.98
## i9->h1 i10->h1 i11->h1 i12->h1 i13->h1
## -19.04 -73.75 -46.80 24.26 21.85
## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2
## -2.01 -0.54 0.14 0.03 0.07 -0.24 0.12 -0.06 -0.58
## i9->h2 i10->h2 i11->h2 i12->h2 i13->h2
## 0.85 0.03 -0.13 0.27 -1.08
## b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3 i8->h3
## 101.33 8.06 84.39 -4.25 26.14 0.96 14.74 -66.93 -19.00
## i9->h3 i10->h3 i11->h3 i12->h3 i13->h3
## -64.05 -3.34 -15.21 13.91 -18.71
## b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4 i8->h4
## -263.73 20.74 2.84 -49.76 21.23 81.77 127.22 -30.61 10.17
## i9->h4 i10->h4 i11->h4 i12->h4 i13->h4
## 48.43 -35.78 -33.23 29.71 -111.64
## b->h5 i1->h5 i2->h5 i3->h5 i4->h5 i5->h5 i6->h5 i7->h5 i8->h5
## -277.75 -61.39 41.17 -20.95 9.33 -79.87 25.11 76.91 -74.78
## i9->h5 i10->h5 i11->h5 i12->h5 i13->h5
## 89.30 99.20 -29.04 -40.58 -210.38
## b->o h1->o h2->o h3->o h4->o h5->o
## 10.97 3.03 25.68 5.50 7.12 8.62
pred = predict(fit5, newdata = test_data, type = "raw")
sqrt(mean((pred[, 1] - test_targets)^2))
## [1] 5.695772
mean(abs(pred[, 1] - test_targets))
## [1] 3.490978
\(~\)
We now focus on the different elements of neural networks.
\(~\)
\(~\)
These choices have been guided by solutions in statistics (multiple linear regression, logistic regression, multiclass regression)
\(~\)
\(~\)
Remark: it is important that the output activation is matched with an appropriate loss function.
\(~\)
Network architecture contains three components:
\(~\)
: How many nodes are in each layer of the network?
: How deep is the network (how many hidden layers)?
: How are the nodes connected to each other?
\(~\)
This depends on the problem, and here experience is important.
\(~\)
However, the recent practice, see e.g. Chollet and Allaire (2018), Section 4.5.6 and Goodfellow, Bengio, and Courville (2016), page 229, is to
This makes the choice of network architecture to be to choose a large enough network. See also Hyperparameter optimization below.
The method part is to choose the loss function for the output layer.
The choice of loss function is closely related to the output layer activation function. To sum up, the popular problem types, output activation and loss functions are:
| Problem | Output activation | Loss function |
|---|---|---|
| Regression | linear |
mse |
| Classification (C=2) | sigmoid |
binary_crossentropy |
| Classification (C>2) | softmax |
categorical_crossentropy |
where the mathematical formulas are given for \(n\) training samples.
Let all network parameters (weights) be denoted by \({\boldsymbol \theta}\).
\[ J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n (y_i-{\hat{y}_1({\boldsymbol x}_i)})^2\] \(\hat{y}_1({\boldsymbol x}_i)\) is the output from the linear output node, and \(y_i\) is the response.
\[ J({\boldsymbol \theta})=-\frac{1}{n}\sum_{i=1}^n (y_i\ln({\hat{y}_1({\boldsymbol x}_i)})+(1-y_i)\ln(1-{\hat{y}_1({\boldsymbol x}_i)})\]
where \(\hat{y}_1({\boldsymbol x}_i)\) is the output from the sigmoid output node, and \(y_i\) is the 0/1 observed class.
\[ J({\boldsymbol \theta})=-\frac{1}{n}\sum_{i=1}^n\frac{1}{C} \sum_{c=1}^C (y_{ic}\ln({\hat{y}_c({\boldsymbol x}_i)})\] where \(\hat{y}_c({\boldsymbol x}_i)\) is the output from the softmax output node \(c\) and \(y_{ic}\) is the observed one-hot coding (0/1) for class \(c\).
Due to how estimation is done (see below), the loss functions chosen “need” to be:
In the 1980-1990s the mean squared error was the prominent loss function also for classification problems, but this has subsequently changed — motivated by the spread of maximum likelihood from statistics to machine learning.
Observe that we have only given the formula for the loss function, and not explicitly assumed anything about any probability distribution of the responses (not even assumed that the responses are random variables). However, we know which statistical model assumptions would give the loss functions as related to the negative of the loglikelihood.
Let the unknown parameters be denoted \({\boldsymbol \theta}\) (what we have previously denotes as \(\alpha\)s and \(\beta\)s), and the loss function to be minimized \(J({\boldsymbol \theta})\).
Illustration: Figure 1.9 from Chollet and Allaire (2018) (again).
Let \(\nabla J({\boldsymbol \theta}^{(t)})\) be the gradient of the loss function evaluated at the current estimate \({\boldsymbol \theta}^{(t)}\), then a step \(t+1\) of the algorithm the new values (estimates) of of the parameters are given as:
\[{\boldsymbol \theta}^{(t+1)}={\boldsymbol \theta}^{(t)} - \lambda \nabla_{\boldsymbol \theta} J({\boldsymbol \theta}^{(t)})\]
The learning rate \(\lambda\) is often set to some small value. In keras the default learning rate is \(0.01\).
Remember that the gradient is the vector of partical derivative of the loss function with respect to each of the parameter in the network.
Q: Why are we moving in the direction of the negative of the gradient? Why not the positive?
The loss function is computed as a mean over all training examples.
\[J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n J({\boldsymbol x}_i, y_i)\]
This means that the gradient will also be a mean over the gradient contribution from each training example.
\[\nabla_{\boldsymbol \theta} J({\boldsymbol \theta})=\frac{1}{n}\sum_{i=1}^n \nabla_{\boldsymbol \theta} J({\boldsymbol x}_i, y_i)\]
To build a network that generalizes well, it is important to have many training examples, but that would make us spend a lot of time and computer resources at calculating each gradient descent step.
We observe that we may see the gradient as an average over many individual gradients, and think of this as an estimator for an expectation. This expectation can we (also) approximate by the average gradient over just a mini-batch (random sample) of the observations.
The idea here is that the optimizer will converge much faster if they can rapidly compute approximate estimates of the gradient, instead of slowly computing the exact gradient (using all training data).
In addition with multicore systems, mini-batches may be processed in parallell and the batch size is often a power of 2 (32 or 256).
It also turns out that small batches also serves as a regularization effect maybe due to the variability they bring to the optimization process.
In the 4th video (on backpropagation) from 3Blue1Brown there is nice example of one trajectory from gradient decent and one from SGD (13:50 minutes into the video): https://www.youtube.com/watch?v=tIeHLnjs5U8
This means that for (mini-batch) stochastic gradient descent we do as follows:
Computing the analytical expression for the gradient \(\nabla J\) is not difficult, but the numerical evaluation may be expensive. The Backpropagation algorithm is an simple and inexpensive way to calculate the gradient.
The chain rue is used to compute derivatives of functions of other functions where the derivatives are known, this is done efficiently with backpropagation.
Backpropagation starts with the value of the loss function and works backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter have in the loss value.
More information:
General concept:
Named variants: In keras the “default optimizer” is RMSprop.
Goodfellow, Bengio, and Courville (2016), Chapter 7, define regularization as any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
We looked at regularization in Module 6, where our aim was to trade increased bias for reduced variance. Another way of looking at this is we need not focus entirely on finding the model of “correct size”, but instead find a large model that has been regularized properly.
In Module 6 we looked in particular at adding a penalty to the loss function. The penalties we looked at were of type absolute value of parameter (\(L_1\), lasso, where we looked at this as model selection) and square value of parameter (\(L_2\), ridge regression). This can also done for neural networks.
In neural networks, weight decay is the expression for adding a \(L_2\)-penalty to the loss function, and is available in the nnet R package.
Other versions of regularization are dataset augmentation and label smoothing:
Dataset augmentation means adding fake data to the dataset, in order that the trained model will generalize better. For some learning task it is straightforward to create the new fake data — for image data this can be done by rotating and scaling the images.
Label smoothing is motivated by the fact that the training data may contain errors in the reponses recorded, and replaced the one-hot coding for \(C\) classes with \(\epsilon/(C-1)\) and \(1-\epsilon\) for some small \(\epsilon\).
(Based on Goodfellow, Bengio, and Courville (2016), Section 7.8)
The most commonly used for of regularization is early stopping.
If we have chosen a sufficiently large model with the capacity to overfit the training data, we would observe that the training error decreases steadily during training, but the error on the validation set at some point begins to increase.
If we stop the learning early and return the parameters giving the test performance on the validation set, this model would hopefully be a better model than if we trained the model until convergence — and this model will then also give a better test set error.
It is possible to think of the number of training steps as a hyperparameter. This hyperparameter can easily be set, and the cost is just that we need to monitor the performance on the validation set during training. Alternatively, cross-validation can be used.
One strategy is to first find the optimal stopping time for the training based on the valiation set (or with cross-validation with small data sets), and then retrain the full training set (including the validation part) and stop at the selected stopping time.
Why is early stopping a regularization technique? By early stopping the optimization procedure has only seen a relatively small part of the parameter space in the neighbourhood of the intitial parameter value. See Goodfellow, Bengio, and Courville (2016), page 250 for a relationship with \(L_2\) regularization.
How to avoid overfitting:
The network architacture, the number of batches to run before terminating the optimzation, the drop-out rate, are all examples of hyperparameters that need to be chosen in a sensible way before fitting the final model.
It is important that the hyperparameters are chosen on a validation set or by cross-validation.
However, a “popular” term is validation-set overfitting and refers to using the validation set to decide many hyperparameters, so many that you may effectively overfit the validation set.
In statistics we use design of experiments to explore these hyperparameters, and just using marginal grids (one hyperparameter at a time) is common in machine learning.
Example on DOE:hyperparameter optimization with boosting (which of cause also can be used for neural networks). Article: Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study (2018), Gustavo A. Lujan-Moreno, Phillip R. Howard, Omar G. Rojas, Douglas Montgomery, Expert Systems with Applications, Volume 109, https://doi.org/10.1016/j.eswa.2018.05.024
\(~\)
(Based on Goodfellow, Bengio, and Courville (2016), Section 7.12, and Chollet and Allaire (2018) 4.4.3)
Dropout was developed by Geoff Hinton and his students.
Alternatively, the drop-out and scaling (now upscaling) can be done during training.
One way to look at dropout is on the lines of what we did in Module 8 when we used bootstrapping to produced many data sets and then fitted a model to each of them and then took the average (bagging). But randomly dropping out outputs in a layer, this can be looked as mimicking bagging — in an efficient way.
See Goodfellow, Bengio, and Courville (2016), Section 7.12 for more insight into the mathematics behind drop-out.
The following is a direct quotation.
figplucker: How was ‘Dropout’ conceived? Was there an ‘aha’ moment?
geoffhinton (2 years ago)
There were actually three aha moments. One was in about 2004 when Radford Neal suggested to me that the brain might be big because it was learning a large ensemble of models. I thought this would be a very inefficient use of hardware since the same features would need to be invented separately by different models. Then I realized that the “models” could just be the subset of active neurons. This would allow combinatorially many models and might explain why randomness in spiking was helpful.
Soon after that I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.
I tried this out rather sloppily (I didn’t have an adviser) in 2004 and it didn’t seem to work any better than keeping the squared weights small so I forgot about it.
Then in 2011, Christos Papadimitriou gave a talk at Toronto in which he said that the whole point of sexual reproduction was to break up complex co-adaptations. He may not have said it quite like that, but that’s what I heard. It was clearly the same abstract idea as randomly removing subsets of the neurons. So I went back and tried harder and in collaboration with my grad students we showed that it worked really well.
Provides different forms to measure how well the predictions are compared with the true values.
accuracy: Percentage of correct classificationsmae: Mean absolute error.These metrics can be monitored during training (on validation set) or in the end on the test set, and can be the basis for the choice when to top training.
(based on Chollet and Allaire (2018))
Neural networks were investigated in “toy form” in the 1950s. The first big step was taken in the 1980s when the backpropagation algorithm were developed (rediscovered) to perform gradient descent in an efficient way.
In 1989 (Bell Labs, Yann LeCun) used convolutional neural networks to classifying handwritten digits, and LeNet was used in the US Postal Service for reading ZIP codes in the 1990s.
Not so much seen (?) activity in the neural network field in the 2000s.
In 2011 neural networks with many layers (and trained with GPUs) were performing well on image classification tasks. The ImageNet classification challenge (classify high resolution colour images into 1k different categories after training on 1.4M images) was won by solutions with deep convolutional neural networks (convnets). In 2011 the accuracy was 74.3%, in 2012 83.6% and in 2015 96.4%.
From 2012, convnets is the general solution for computer vision tasks. Other application areas are natural language processing.
Deep learning does not mean a deeper understanding, but refers to sucessive layers of representations - where the number of layers gives the depth of the model. Often tens to hundreds of layers are used.
Deep neural networks are not seen as models of the brain, and are not related to neurobiology.
A deep network can be seen as many stages of information-destillation (Chollet and Allaire (2018), page 9), where each stage performes a simple data transformation. These transformations are not curated by the data analyst, but is estimated in the network.
In statistics we first select a set of inputs, then look at how these inputs should be transformed (projections in simple form or high-dimensional and nonlinear forms), before we apply some statistical methods. This transformation step can be called feature engineering and has been automated in deep learning.
Deep Learning is an algorithm which has no theoretical limitations of what it can learn; the more data you give and the more computational time you provide, the better it is.
Geoffrey Hinton (Google)
In addition, this built-in feature engineering of the deep network is not performed in a greedy fashion, but jointly with estimating/learning the full model.
The success of deep learning is dependent upon the breakthroughts in hardware development, expecially with faster CPUs and massively parallell graphical processing units (GPUs). Tensor processing units (TPUs) is the next step.
Achievements of deep learning includes high quality (near-human to super human) image classification, speech recognition, handwriting transcription, machine translation, digital assistants, autonomous driving, advertise targeting, web searches, playing Go and chess.
Earlier good programming skills in C++ was essential to work in deep learning. In addition also skills on programming for GPUs were needed (e.g NVIDIA CUDA programming interface). With the launch of the Keras library now users may only need basic skills in Python or R.
Keras can be seen as a way to use LEGO bricks in deep learning. To quote the web-page:
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
Tensorflow is a symbolic-tensor manipulation framework that also can perform autodifferentiation.
A tensor is defined by its number of axis (below), shape (dimension) and data type (double, integer, character)
Tensor operations (reshaping, dot product) are performed in TensorFlow.
The use of Tensorflow in R Keras is referred to as “using Tensorflow as a backend engine”. Tensorflow is the default backend.
More information on the R solution: https://keras.rstudio.com/
Cheat-sheet for R Keras: https://github.com/rstudio/cheatsheets/raw/master/keras.pdf
Chollet and Allaire (2018) Section 4.5 has the following recommentations for step 3+5:
Here is a tutorial: https://keras.rstudio.com/articles/tutorial_overfit_underfit.html
This is a larger image version of the handwritten digits data set (a different version, ZIP-codes is found under Recommended exercises).
This data analysis is based on https://www.math.ntnu.no/emner/TMA4268/2018v/11NN/8-neural_networks_mnist.html and the R keras cheat sheet. An advanced version using convolutional neural nets is found here: https://www.math.ntnu.no/emner/TMA4268/2018v/11NN/12-neural_networks_convolution_mnist.html
Objective: classify the digit contained in an image (128 \(\times\) 128 greyscale). Problem type: Multiclass classification based on image data.
Labels for the training data:
## train_labels
## 0 1 2 3 4 5 6 7 8 9
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
60 000 images for training and 10 000 images for testing.
# Training data
train_images <- mnist$train$x
train_labels <- mnist$train$y
# Test data
test_images <- mnist$test$x
test_labels <- mnist$test$y
org_testlabels <- test_labels
The train_images is a tensor (generalization of a matrix) with 3 axis, (samples, height, width).
In this case we are using a layer_dense (fully connected) which expects an input tensor of rank equal to two (sample, features) where each sample should contain 28*28=784 pixels. Adding a bias term (intercept) is default for layer_dense.
network <- keras_model_sequential() %>% layer_dense(units = 512, activation = "relu",
input_shape = c(28 * 28)) %>% layer_dense(units = 10, activation = "softmax")
summary(network)
## Model: "sequential"
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense (Dense) (None, 512) 401920
## ___________________________________________________________________________
## dense_1 (Dense) (None, 10) 5130
## ===========================================================================
## Total params: 407,050
## Trainable params: 407,050
## Non-trainable params: 0
## ___________________________________________________________________________
As activation function we use relu for the hidden layer, and softmax for the output layer - since we have 10 classes (where one is correct each time). We could have included more layers in our model - and then maybe used early stopping if our model was chosen too big.
We next need to choose the loss function we will use, which is cross-entropy, and then the version of the optimizer. Here htis is RMSprop. Finally, which measure - metrics - do we want to monitor in our training phase? Here we choose accuracy (=percentage correctly classified).
network %>% compile(optimizer = "rmsprop", loss = "categorical_crossentropy",
metrics = c("accuracy"))
The training data is scored in an array of dimension (60000, 28, 28) of type integer with values in the [0, 255] interval. The data must be reshaped into the shape the network expects (28*28). In addition the grey scale values are scales to be in the [0, 1] interval.
Also, the response must be transformed from 0-10 to a vector of 0s and 1s (dummy variable coding) aka one-hot-coding.
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images/255
train_labels <- to_categorical(train_labels)
test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images/255
test_labels <- to_categorical(test_labels)
###5. Fit the model
fitted<-network %>% fit(train_images, train_labels,
epochs = 30, batch_size = 128)
library(ggplot2)
plot(fitted)+ggtitle("Fitted model")
###6. Evaluation and prediction
network %>% evaluate(test_images, test_labels)
testres = network %>% predict_classes(test_images)
# $loss [1] 0.1194063 $acc [1] 0.9827
confusionMatrix(factor(testres), factor(org_testlabels))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 971 0 3 0 1 2 5 1 1 1
1 1 1128 2 0 0 0 2 2 2 3
2 1 1 1009 3 3 0 2 7 2 0
3 0 1 2 997 0 11 1 3 4 3
4 1 0 2 0 969 1 4 2 3 7
5 0 1 0 1 0 871 3 0 1 4
6 3 2 2 0 3 4 940 0 1 0
7 1 1 4 2 1 0 0 1002 2 3
8 2 1 7 0 0 2 1 4 953 1
9 0 0 1 7 5 1 0 7 5 987
Overall Statistics
Accuracy : 0.9827
95% CI : (0.9799, 0.9852)
No Information Rate : 0.1135
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9808
Mcnemar's Test P-Value : NA
Kaggle has hosted machine-learning competitions since 2010, and by looking at solutions to competitions it is possible to get an overview of what works. In 2016-2017 gradient boosting methods won the competitions with structured data (“shallow” learning problems), while deeplearning won perceptual problems (as image classification), Chollet and Allaire (2018) (page 18). Kaggle has helped (and is helping) the rise in deep learning.
taken from Chollet and Allaire (2018): https://www.math.ntnu.no/emner/TMA4268/2018v/11NN/11-neural_networks_boston_housing.html
relu, linear, sigmoid) inner products - sequentially connected.nnet in Rkeras in R. Use of tensors. Piping sequential layers, piping to estimation and then to evaluation (metrics).(The first two topics are covered in Chollet and Allaire (2018))
Recurrent networks: extending the feedforward network to also have feedback connections. This is a popular type of network to analyse time series data and natural language applications.
Convolutional networks: some layers in a sequential network contain operations specially suitable for grid-like topology (images). Convolution is used in place of general layer (where we do matrix multiplication) in at least one layer. A popular operation is pooling.
Explanable AI (XAI): how to use methods on the network or the predictions of the network to figure out the underlying reasoning of the network. Popular methods are called LIME (local linear regression), Shaply (concept from game theory). The DALEX R package contains different socalled explainers https://arxiv.org/abs/1806.08915.
Chollet, François, and J. J. Allaire. 2018. Deep Learning with R. Manning Press. https://www.manning.com/books/deep-learning-with-r.
Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference - Algorithms, Evidence, and Data Science. Cambridge University Press.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.